-
Notifications
You must be signed in to change notification settings - Fork 413
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OAK-11568 Elastic: improved compatibility for aggregation definitions #2193
Conversation
thomasmueller
commented
Mar 19, 2025
•
edited
Loading
edited
- Analyzer configuration is now lenient, quite similar to the Lucene index behavior. This will allow converting Lucene indexes to Elasticsearch. Warnings are logged where needed.
- This PR also removes unused code, and reduces compiler warnings.
- The tests in ElasticIndexHelperTest are about problems trying to load files that are not configured (IllegalStateException etc.)
- The tests in FullTextAnalyzerCommonTest are about compatibility problems
- With the NGram Tokenizer (not filter), behaviour is different between Elastic and Lucene for one case: if the query contains multiple words, the result is found with Lucene, but not with Elastic.
Commit-Check ✔️ |
|
if ("n_gram".equals(name)) { | ||
// OAK-11568 | ||
// https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-ngram-tokenizer.html | ||
Integer minGramSize = getIntegerSetting(args, "minGramSize", 2); | ||
Integer maxGramSize = getIntegerSetting(args, "maxGramSize", 3); | ||
TokenizerDefinition ngram = TokenizerDefinition.of(t -> t.ngram( | ||
NGramTokenizer.of(n -> n.minGram(minGramSize).maxGram(maxGramSize)))); | ||
return ngram; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is okay for now. We should structure it better to cover all the possible tokenizers (https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenizers.html). This can go in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree!
name = "hyphenation_decompounder"; | ||
String hypenator = args.getOrDefault("hyphenator", "").toString(); | ||
LOG.info("Using the hyphenation_decompounder: " + hypenator); | ||
args.put("hyphenation_patterns_path", "analysis/hyphenation_patterns.xml"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should "analysis/hyphenation_patterns.xml"
be installed in the Elastic nodes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to use a fixed name, so it is possible to configure it. Installing this would have to be done manually, and we need to document it.
if (skipEntry) { | ||
continue; | ||
} | ||
String key = name + "_" + i; | ||
filters.put(key, factory.apply(name, JsonData.of(args))); | ||
if (name.equals("word_delimiter_graph")) { | ||
wordDelimiterFilterKey = key; | ||
} else if (name.equals("synonym")) { | ||
if (wordDelimiterFilterKey != null) { | ||
LOG.info("Removing word delimiter because there is a synonyms filter as well: " + wordDelimiterFilterKey); | ||
filters.remove(wordDelimiterFilterKey); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option could be the use of https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-multiplexer-tokenfilter.html
We can work on this in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I also thought about that. I didn't find a very good documentation about it yet.